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ABSTRACT 

Branching processes model the evolution of populations of agents 
that randomly generate offsprings. These processes, more patently 
Galton-Watson processes, are widely used to model biological, so- 
cial, cognitive, and technological phenomena, such as the diffusion 
of ideas, knowledge, chain letters, viruses, and the evolution of hu- 
mans through their Y-chromosome DNA or mitochondrial RNA. 
A practical challenge of modeling real phenomena using a Galton- 
Watson process is the offspring distribution, which must be mea- 
sured from the population. In most cases, however, directly mea- 
suring the offspring distribution is unrealistic due to lack of re- 
sources or the death of agents. So far, researchers have relied on 
informed guesses to guide their choice of offspring distribution. In 
this work we propose two methods to estimate the offspring dis- 
tribution from real sampled data. Using a small sampled fraction 
of the agents and instrumented with the identity of the ancestors 
of the sampled agents, we show that accurate offspring distribu- 
tion estimates can be obtained by sampling as little as 14% of the 
population. 

1. INTRODUCTION 

Branching processes, more markedly Galton-Watson (GW) pro- 
cesses, have been used to model a variety of phenomena, ranging 
from human Y-chromosome DNA and mitochondrial RNA evolu- 
tion 1 5 1, to epidemics on complex networks |6|, to block dissemi- 
nation in peer-to-peer networks |j8J. The GW process can be rep- 
resented as a growing tree, where agents are nodes connected to 
their offspring by edges. The number of offspring is a random vari- 
able associated with a distribution function. An example of a GW 
branching process is a family tree considering either only the fe- 
males or only the males in the family (which represent the transmis- 
sion of mitochondrial RNA or Y-chromosome DNA, respectively). 
A GW process is completely characterized by its offspring distribu- 
tion. A practical challenge when modeling real world systems from 
a GW process is knowing the offspring distribution of the process, 
which must be measured from the population. 

In most applications, however, directly measuring the offspring 
distribution is unrealistic due to the lack of resources or the inac- 
cessibility of agents (e.g. death). It is not reasonable to assume 
that one can collect genetic material from the entire human popula- 
tion or that in the branching process of chain letter signatures (see 
Chierichetti et al. 1 3 1 for further details), one may collect all pos- 
sible branches of the chain letter created by forwarding the letter. 
So far, researchers have relied on informed guesses to guide their 
choice of offspring distribution. 

In this work we propose a collection of methods to estimate the 
offspring distribution from real sampled data. Our goal is to accu- 
rately estimate the offspring distribution by sampling and coUect- 
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ing ancestors ids of a small fraction of the agents. We study the 
case where a sampled agent reveals the identity of its ancestors and 
the trees are generated in the supercritical regime (i.e., average off- 
spring > 1) when the maximum offspring and the maximum tree 
height are upperbounded by a (possibly large) constant. We show 
that accurate offspring distribution estimates can be obtained by 
sampling as little as 14% of the population. 

A related problem is characterizing graphs using traceroute sam- 
pling. Traceroute sampling from a single source can be thought as 
sampling a tree where nodes have different offspring (degree) dis- 
tributions depending on their position with respect to the source. 
This is an important well known hard problem 1 1 4| and it remains 
open to date. Our results have the added benefit of shedding some 
light also into the traceroute problem. 

The outline of this work is as follows. Section [2] describes the 
network and sampling models. In Section [3] we first show how to 
estimate the offspring distribution through exact inference, show- 
ing it does not scale. We then propose an MCMC method of per- 
forming approximate inference that works for small and medium 
sized trees (up to 2,000 nodes). In Section [4] we evaluate both 
methods using a set of 900 syntethic datasets, comprising small and 
medium trees. For small trees, exact inference yielded accurate es- 
timates and outperforms the approximate estimator. On the other 
hand, approximate inference can handle larger trees, while obtain- 
ing significant improvement over more naive approaches. Finally 
Section[5]presents our conclusions and future work. 

2. MODEL 

We assume that the underlying tree comes from a Galton-Watson 
(GW) process. The GW process models the growth of a population 
of individuals that evolves in discrete-time (n — 0, 1, 2, . . . ) as 
follows. The population starts with one individual at the 0-th gen- 
eration (n = 0). Each individual i at the n-th generation produces 
a random number of individuals at the (n + l)st generation, called 
offspring. The offspring counts of all individuals are assumed to be 
i.i.d. random variables. An instance of the GW process is therefore 
described by a sequence of integers which denote the number of 
individuals at each generation. 

Formally, the GW process is a discrete-time Markov Chain 
{X„}n^i, where L is the number of generations, given by the fol- 
lowing recursion 

i = l 

with Xo = 1, where the v/"' > are i.i.d. random variables 
with distribution G — {6o, . . . , Ow), Vi, n > 1, where W is the 
maximum number of offspring of an agent. The GW process can 
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Figure 1: (a) Branching process tree, (b) Samples with 2 and 3 
targets, respectively. 



be seen as a generative process of a tree G = {V, E), where X„ is 
the number of nodes at the nth generation and is the offspring 
count of the ith node at the nth generation. For simplicity, we 
assume that 6^0 = and that the number of generations is fixed, so 
that all tree leaves sit exactly at generation L. Our results, however, 
can be easily adapted to the case where 6o > and the leaves have 
different levels. But the above assumptions lead to a simpler model 
in the sense that we can have average offspring greater than one 
without worrying about infinite trees. 

Since the numbers of offspring are mutually independent, the 
probability of a given tree G is 



p(Gi0)=n^? 



(1) 



where Cj — J]]^ — j} is the number of nodes with 

offspring count j. Fig. [la] depicts an example of tree generated 
from = (0.3,0.6,0.1)with L = 3. In this case, P(G|0) = 
0.3^ • 0.6^ = 0.108. 

Sampling Model 

A node is said to be observed when the sampling process explic- 
itly reveals its presence in the original graph (e.g. node look up is 
performed or node spontaneously advertise its presence). The ob- 
served path, however, consists of the observed node and its path to 
the root. A sample is a set of observed paths. 

Let V' C V he a set of randomly observed nodes of the unla- 
beled graph G. Let 5* be the sampled tree formed by the union of 
the path s from all nodes v £ V' to the root of G. For instance. 
Fig. |lb| shows sampled trees Si formed by V' = {a, b} and S2 
formed by V' = {o, b, c}. We assume that nodes in V' are sam- 
pled from V with probability p. 

We now show how to compute P{S\G), assuming that that V' is 
known, i.e., we know which nodes in S are observed. However, it 
is easy to modify the following analysis to the cases where (1) we 
only know \V'\ or (2) we know the topology of S, but not which 
or how many nodes are observed. Let Cg,s be the number of ways 
in which 5 can be mapped onto G. Clearly, Cg,s = if S is not 
a subgraph of G. Conditioning on a given mapping, we must have 
exactly jl^'j nodes chosen as targets and j V \ V'| not chosen as so. 
Therefore, 



\v\v'\ 



(2) 



Computing Gg,s can be done recursively by first computing Cij, 
the number of ways the i-th subtree connected to the root of 5* can 
be mapped to j-th subtree connected to the root of G, for all 
Now consider the matrix C — [cij]nxm- If we define the operator 



Em 



n > 1 
n — I, 



(3) 



Figure 2: Graphical model representing network generation 
and sampling. White nodes are unobservable and shaded node 
is observable. 



and hence. 



where Cij is C after removal of the 1st row and jth column, then 
we can show that Gg,s = jC|. Consider the simple case of G and 

S2 shown in Fig. lb Here we have C = ^ 

|C| = l- l + l- 2 = 3. We can visually check that this is indeed 
the number of ways to map S2 onto G. Therefore, P{S2\G) = 
Sp'il-Pf- 

Inference on the structure of the tree G from the partial obser- 
vation S is possible because we can compute P{S\G') for any G'. 
This, in turn, allows us to do inference on the offspring distribution 
by considering how likely G' is to be generated from 6 by using 
P{G'\6) and weighting by how likely S is to be sampled given G'. 
In the next section we propose two estimation methods based on 
this idea. 

3. ESTIMATORS 

We consider the problem of estimating the offspring distribution 
G of the GW process that generates a tree G given a sample S 
consisting of the union of random observed paths when nodes are 
observed with probability p. 

Two approaches to this problem based on Maximum Likelihood 
Estimation are proposed in this paper. While the former consists 
of the exact computation of the likelihood function P{S\0), the 
latter approximates this function via Metropolis-Hastings with im- 
portance sampling. 

3.1 Exact inference 

The graphical model in Fig. |2] depicts the statistical relationship 
between S, and G. The shaded node, S, is the only observable 
variable, while the white nodes, 6 and G are unobservable. This 
figure shows that to find the relationship between S and 9, we have 
to sum over the variable G, i.e., over all possible trees given the 
number of generations L and the maximum degree W . Let Ql,w 
be the set of all possible trees given L and W . It follows that 

= J2 P{S\G,e)P{G\0) 

GeSL.iv 

= J2 PiS\G)PiG\e), (4) 

where from line 2 to line 3 we use the fact that S is conditionally 
independent on given G (see Fig.[2](. However, \Ql,w\ grows 
exponentially both in L and W, which limits this approach to very 
small trees. In fact, we can show that 



L = 1 
L>1. 



X^i^i \Sl-i,w\^ > \Gl-i,w\ 
rsion yiek 

log' ' is the repeated logarithm. Note however that isomorphic 



Solving the recursion yields log(-^"'' \gL,w\ = 0{W), where 



L 


1 


2 


3 


4 


5 


\Ql,z\ « 


3 


39 


6 X 10* 


2.3 X 10^" 


1.2 X 10*^ 




3 


19 


1.5 X 10^ 


6.1 X 10** 


3.8 X 10^^ 



Table 1: Growth of the space of trees as a function of L, for 
W = 3. 



trees are being counted more than once. Therefore, we can reduce 
the computational cost by counting only non-isomorphic trees (ap- 
propriately weighted by their multiplicity). 

Let be the maximal set of non-isomorphic trees of Ql,w- 



It is possible to show that 

'w, 



I /-> non-ISO I 

\yL,w I 



non-ISO I 
1, W I 

W + 1 



L = 1 
1, L>1. 



Table 3 . 1 illustrates some values of | , w | and 1 5™ ° j f or = 3 
and L = 1, . . . , 5. As we can see, counting only non-isomorphic 
trees reduces significantly the state space, but it is still not feasible 
to compute eq. (|4j except for rather small numbers such 2&W = 
3 and L = 4. Nevertheless, we utilize this approach to perform 
inference more efficiently. In the following, we explain how to 
enumerate trees in and how to compute their multiplicities. 

Counting only non-isomorphic trees 
A straightforward way to enumerate all trees in is: (1) to 

enumerate non-isomorphic trees in C/™.!!.! "^k and assign a numeric 
id to each of them; and (2) construct trees in Q'i^'"w° by attaching 
to a root node trees from GT-i°w where ids are in non-increasing 
order Note that two trees are isomorphic in this construction if the 
sets of ids of the subtrees connected to the root node are permuta- 
tions of each other, which cannot occur due to the ordering. 

In what follows we compute the probability that sample S is ob- 
served given the offspring distribution 9 through the enumeration 
of non-isomorphic trees. Let m^^' denote the multiplicity of the 
i-th tree, say d, in the labeled space GT^w" ■ Eq- j4j is equivalent 
to 



(5) 



The multiplicity m- can be calculated from the ids of subtrees 
directly connected to the root node in d and their multiplicities. 
More precisely, m'^' is simply the number of permutations of the 
ids multiplied by the product of the multiplicities of each subtree. 
For instance, if there are j subtrees connected to the root with dis- 



tinct ids (1), . . . , (j), then ml ' = j\ x Yll=i 
general case, where ids can appear more than once, we have 



AL-l) 

\fe) 



In the 



i! X n 



k=i ' 



n: 



ELi Hid = k} 



The first estimator we propose is 



0Exact = argmaxP(5|0) 



(6) 



where P{S\0) is computed as in (|5}. 
Maximum Likelihood Estimation 

After obtaining a sample, we write the summation in Eq. ([5]l as a 
function of 0. Unfortunately, this likelihood function is a sum of a 



potentially enormous number of terms and using the log-likelihood 
is not helpful in this case. We apply several tricks to solve this 
optimization task. 

One simple trick to reduce the number of terms consists of group- 
ing together trees that have the same configuration in terms of off- 
spring counts, i.e., that account for the same P{G\6). Note that 
there are many such trees even when considering non-isomorphic 
trees only, although they correspond to different values of P{S\G). 

Also note that this is a constrained maximization problem. Since 
is a probability distribution, < 6i < 1 for z £ {1, . . . , and 
X^i^i = 1- We can turn it into a non-constrained maximization 
problem by replacing Oi = 2_i for i g {1, . . . , W} where Z = 

X^i^i fi"'' setting aw = 1 (for regularization purposes) and then 
maximizing w.rt. ex. Note that can now assume any value in 
R for i £ {1,...,VK — 1}. Nevertheless, one must be careful 
when using this parameter transformation since the products of the 
exponentials can quickly lead to overflows. Therefore, we use log 
representation and the logsumexp trick. 

After this transformation, the maximization problem becomes 



maxZ(a) —} c 

ry ^ — ^ 



where Cj is the sum of the coefficients of the terms corresponding to 
the same j-th configuration of P{G\9), Xji is the number of nodes 
with offspring i in the j-th configuration and yj = X]i=i ^ji- In 
order to compute the likelihood function and its gradient more effi- 
ciently, we express them in matrix notation as 

^(q) = ■ (exp(Xa)/Z^) 

Vl{a) = exp{Xa)/Z^ - exp(Xa)/Z^+^ 

where c = [cj], X = [xj^], a = [oi], y = [y-j], Z^ = [Z^^], the 
"/" symbol corresponds to division of two vectors element-wise 
and 1 is a column vector with all entries equal to 1 . 

The maximization then goes as follows. We sample 10, 000 
points uniformly from R^~^. The one with the maximum value 
of will be a'"', the starting point to be used with the BFG9j 
(limited to 100 iterations, relative convergence tolerance of 10~ , 
step size 10^^). The estimate can be obtained from a by expo- 
nentiating and then normalizing the latter 

3.2 Approximate inference with MCMC 

The previous approach only applies to small problems due to the 
enormous number of terms in the summation (|5j. To solve larger 
problems, we approximate eq. (|4]l using MCMC. 

Let h = P{S\G) and /(G) = P{G\0). Since /(G) defines a 
probability distribution on the space it follows that 



PiS\0)^Y.^'f(G) = Ef[h]- 



(7) 



where Ef[.] denotes expectation w.rt. distribution /. 

Monte Carlo simulation approximates expectations (integrals, more 
generally) by sampling from a desired distribution / [2 |. The prob- 
lem here is that we cannot sample from / because we don't know 0. 
However, we can sample from some other distribution g and com- 
pensate for the fact that in g some trees are more (or less) likely to 
appear than in / by using importance sampling. More precisely. 



P{S\0) = J2 hfiG) = E ^^9iG) = E: 



/(G) 



(8) 



'We use R implementation in package stats. 



Recall from Section |2] that we can generate trees using the GW 
process from a given offspring distribution 60 ■ Hence we can set 



g{G) = -P{s\G)P{G\eo) 



(9) 



where Z is a normalizing constanj^ Substituting eq. |9j into ^ 
yields 



P{S\0) = E, 



P{S\G)P{G\0) 



[^PiS\G)P{G\0o) 



Z ^ P{G,\e) 



'It m|0o)' 



where Gi ~ g{G). Note that ^ is not a function of 6 and do not 
need to be considered when maximizing 6. Therefore, the second 
estimator we propose is 



e 



Approximate 



P(G,|6>) 



I — 1 ^ ' ' 



(10) 



In order to draw Gi ~ g{G), we use the Metropolis-Hastings 
algorithm where each state Xj of the Markov Chain is a tree. We 
start the chain in a state Xq consistent with 5", in particular, we set 
Xo = S. The transition kernel Xi — >■ Xi+i we use is shown in 
Algorithm^ The new tree Xi+i is accepted with probability 

Algorithm 1 Transition Kernel(Xi, Xi+i) 

V -ir- internal node selected uniformly at random from Xi 
dy <r- degree(?;) 
if d„ = 1 then 

action <— add 
else itd^ = W then 

action remove 
else > 1 < d„ < W 

if [7(0, 1) < 0.5 then > [7(0, 1) is the uniform dist. 

action ^ add 

else 

action remove 
end if 
end if 

if action = add then 

r„ ^ GaltonWatson(0o, ^ - 

v.child[dv + 1] Ti, > adds new branch 

dv <^ dy + 1 
else if action = remove then 

shuffle(u.c/ij/d) t> shuffle children 

v.child[dv] <— nil > removes "right-most" branch 

end if 



p{s\x,+i)P{x,+i\eo)q{x,^ 



XA 



p{s\x,)P{x,\eo)q{x, ^ x,+i) 



(11) 



where q{Xi — )■ Xj) is the probability that the transition kernel 
proposes transition Xi — >■ Xj . It is easy to include the calculation 
of q(Xi — )■ Xi+\) and q{Xi+\ — >■ Xi) in the transition kernel 
implementation. In particular, let Ni and Li denote the number of 
nodes and leaves in Xi, respectively. Hence, if action = add. 



q{X^ 



^X,) = 



Q 5i{d„>i} ^ p(j;|gQ) 

N,-L,-l 
5i{d„+i<H^}(^^^^)-i 



N, 



+1 



^We could have set g{G) — P{G\Oo) instead, but our approach 
restrict us to generating trees that are consistent with the sample 
and thus, is more efficient. 



otherwise, 

q{Xi ^ Xi+i) = 

q{Xi+i -5> Xi) = 



5i{d„<w}^-i 
N,-L,-l ' 



+1 



where 0.5^^'*"^^^ accounts for the fact that if v has degree > 1, 
action add is chosen with probability 0.5, but when d„ = 1, add is 
always chosen. The case for remove is similaj^ 

Maximum Likelihood Estimation 

After obtaining roughly independent samples Gi ~ (;(G), we write 
the summation in the RHS of eq. ^ and perform maximization as 
in the case of exact inference. 

4. EXPERIMENTS AND RESULTS 

We first describe the experiments used to assess the performance 
of the two estimation methods, henceforth referred to as EXACT 
and Approximate, respectively. We then compare methods w.r.t. 
the KL-divergence of the estimated distribution from 6. In addi- 
tion, we show some results in detail to illustrate the Mean Squared 
Error (MSB) per distribution parameter and how performance in- 
creases with the sampling probability. In general, EXACT performs 
best but is only feasible for small datasets. Nevertheless, APPROX- 
IMATE exhibits comparable performance and can cope with larger 
datasets (up to 2,000 nodes). 

4.1 Experiments description 

Based on the size of Ql,w define two classes of estimation 
problems: small and medium size problems. For medium size ones, 
we would like to compare the methods' performance for short and 
long tail offspring distributions, hereby represented by truncatecj^ 
Poisson and Zipf distributions, respectively. Parameters of these 
distributions were chosen so that their average is d. 

In what concerns the sampling process, we choose three sam- 
pling probabilities representing low, medium and high sampling 
rates for each class. The set of values of p has to be different for 
each class for two reasons. The practical reason is that as the tree 
size grows, the cost to sample it grows linearly on p and we may be 
limited by a budget. The second reason is that, if there is no such 
constraint, while values of p such as 0.5 are reasonable for small 
problems, they will likely reveal all nodes from the top levels for 
large problems. Hence, taking the empirical distribution from the 
first levels per se would be an accurate estimator. Inside each class, 
consider the following distributions and sampling probabilities; 

1. Small size: = 3, L = 3, d = 2.1 

• 6»(i) = (0.2,0.5,0.3) 

• p e {0.1,0.2,0.5} 

2. Medium size: = 10, L = 5, d = 3.15 

• 0^'^'> ~ truncated Poisson(A = 3) 

• e*^) ~ Zipf(a = 1.132, iV = 10) 

• pG {0.5,1.0,5.0} X 10-2 

Average tree sizes per class are ~ 17 and ~ 454, respectively. 

In order to test the inference methods, we build a set of estima- 
tion problems as follows. For each distribution i = 1, . . . , 3, 

^^AU calculations should be performed in log space. 
'*Here truncated means that we took the original probability mass 
function for values between 1 and W and normalized by their sum, 
while setting the probability mass of other values to zero. 
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Figure 3: Boxplots of the MSB per parameter for 0^^\ 

we generate 10 trees tij, j = 1, . . . , 10 from a GW process with 
height L + 1 (30 trees in total). Next, for each of the 90 pairs 
(tij,pik), = 1, 2, 3, we generate 10 samples Sij^;, I = 1, . . . , 10 
(900 samples in total). 

We assume each sample Sij^i constitutes a separate estimation 
problem (also referred to as dataset to avoid confusion with MCMC 
samples). This can be interpreted as if we had one tree (originated 
from the GW process), and a single opportunity to sample it. No 
other samples can be obtained from the same tree, nor other trees 
are available for sampling. Ideally, we would like to try both meth- 
ods with each problem, but EXACT is only feasible for small prob- 
lems. Before presenting the results, we briefly discuss implemen- 
tation issues related to APPROXIMATE. 

4.2 Implementation issues of APPROXIMATE 

The main difficulty in the APPROXIMATE method is knowing 
when to stop the approximation as, without knowing the true distri- 
bution, we need a mechanism that tells us how close we are to the 
steady state distribution of the Markov chain. 

Recall that we use the Metropolis Hastings (MH) algorithm to 
sample graphs from g[G) (see Eq. l[8j). As with any MCMC method, 
three questions must be addressed: (1) How long should the burn- 
in period be? (2) What should the thinning ratio be? (3) What is 
the minimum number of uncorrelated samples that we need? We 
use the Raftery-Lewis (RL) Diagnostic |7| to address these issue|^ 

The RL Diagnostic attempts to determine the necessary condi- 
tions to estimate a quantile q of the measure of interest, within a 
tolerance r with probability s. We take the likelihood of the MH 
samples as the measure of interest. The diagnostic was then applied 
individually to each dataset with default parameters (g = 0.025, 
r = 0.005 and s = 0.95). Results concerning the bum-in period 
and thinning ratio are subsumed by the required number of sam- 
ples and hence will be ommited. The minimum number of MCMC 
samples for small datasets was less than 50,000 graphs and for 
medium datasets, less than 500,000, in summary. We conducted 
some experiments with more MCMC samples than those values, 
but there was no significant improvement w.r.t. the estimation ac- 
curacy. Therefore, the results described in the following refer to the 
minimum number of samples suggested by the Raftery-Lewis test. 

Last, recall from Section |3.2| that 0o can be any distribution. 
However, the closer it is to 6, the better is the convergence of the 
MCMC. When estimating the offspring distribution in medium size 
problems, we will assume that 6o is binomial and set its parame- 
ters so that the average is d. This implies assuming that the average 
number of offspring can be estimated, but in fact a rough estimate 
can be obtained by simply taking the average of the observed node 
degrees from the first generations in the sample, whose edges have 
a relatively high probability of being sampled. For small sized tree, 
we simply set Oo to be uniform. 



4.3 Results 

The estimation results span over a number of dimensions equal 
to the number of parameters assumed in the multinomial distribu- 
tion. We use the KuUback-Leibler (KL) divergence as an objective 
criterion to compare the estimation methods in a single dimension. 

Let the estimated offspring distribution be = {6\, . . . , 9w)- 
The KL-divergence of 6 from 6 is defined by 

w 

Dkl(6»||0) = ^(loge. - log^O^^, (12) 

"■We use the R implementation in package coda. 

when 6i > 0, i = 1, ■ • ■ , W . When this condition does not al- 
ways hold, as in our case, absolute discounting is frequently used 
to smooth 0. Hence, we distribute e = 10~^ of probability mass 
among the zero estimates, discounting this value equally from the 
non-zero estimates. 

Table |2] shows the median KL-divergence obtained for each set 
of problems (indexed by 0*^"', i = 1, . . . , 3), for EXACT and AP- 
PROXIMATE, when the sampling probability p is medium. Dashes 
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2.98 


0.58 


0.78 



Table 2: Median KL-divergence of estimators 

indicate that EXACT could not find estimates for medium size prob- 
lems in a reasonable amount of time. However, it outperfomed 
Approximate in the estimation of 0*^'. Note that although KL- 
divergence implies some ordering within each column in terms of 
accuracy, neither the relative ratios have a direct interpretation, 
nor values accross different columns can be compared. We will 
next evaluate the results w.r.t. the MSE of each parameter estimate, 
which will allow us to conclude that the performance of APPROX- 
IMATE is in fact very close to the one of EXACT for small datasets. 

The effect of sampling probabilities 

As we increase p, we gather more information about the original 
graph and hence estimators will clearly perform better. We study 
the performance gains w.r.t. the MSE of the parameter estimates. 

Figs.jsja-b) show boxplots of the MSE of the estimates 6i, i = 
1,. . . ,W obtained by EXACT and APPROXIMATE, respectively, 
for datasets coming from 0^^\ Each boxplot shows minimum, 1st 
quartile, median, 3rd quartile and maximum values, computed over 
100 estimates (10 samples for each of the 10 trees). Colors corre- 
spond to different sampling probabilities. In both cases, the median 
MSE increases as we decrease p, as expected. 

Similarly, Fig. |4] shows the results obtained by APPROXIMATE 
for datasets that come from 0^^\ In general, increasing the sam- 
pling probability reduces the MSE, but not by a significant amount. 
Results for 0^'^^ are similar and will be ommitted. 

We conjecture that most of the information that allows us to es- 
timate 9 comes from the top levels of the tree. As we increase p, 
we obtain many more observations from the bottom levels of the 
tree, but only a few new observations from the top levels. While 
edges closer to the root are observed with higher probability, edges 
from lower levels are more rarely sampled and there is much more 
uncertainty in those samples. This implies that increasing p should 
not improve the estimates significantly after a certain point. 
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Figure 4: Boxplots of the MSE of Approximate for 6 
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Figure 5: Median MSE of Approximate and empirical esti- 
mates for 



This short digression might lead the reader wonder whether the 
values of p we use would sample so many edges from the top lev- 
els that would be enough to take the empirical distribution of the 
observed degrees at those levels as an estimate for G. Hence, we 
compare the MSE results for APPROXIMATE with the empirical 
distribution of the observed degrees from the top 1 , 2 and 3 levels 
in a cumulative fashion. Intuitively, the empirical distribution is bi- 
ased towards smaller degrees, especially if lower levels are taken 
into account, this being the reason why we stop at 3 levels. 

Fig.|5]shows the median values of the MSE (also seen in the pre- 
vious figure), but only for "small" and "large" p values, for the sake 
of clarity. In addition, dashed lines display the median MSE ob- 
tained when the empirical distributions are used as estimators. Es- 
timates forp = 5 X 10^'^ exhibit a one-order magnitude gain in ac- 



curacy (for most parameters) relative to the best empirical estimate, 
but estimates forp = 5 x 10"^ only yield significant improvements 
at the tail of the distribution. In general, empirical distributions are 
not good estimates, especially for distribution tails due to its bias 
towards small degrees. One exception we found was in the case 
of 0'^', where the probability mass at the tail is so large that high 
degree nodes are likely to be observed at the top levels. However, 
we observed in additional experiments that this is not the case for 
long tailed distributions with larger support, such asW = 100. 

5. CONCLUSIONS 

In this paper we propose and analyze two methods to estimate 
the offspring distribution of a branching process from a sample of 
random observed paths to the root. The former, based on exact in- 
ference, is limited to small problems since the number of terms to 
be computed in the likelihood function grows exponentially with 
the maximum degree and number of levels. The latter, approxi- 
mates the likelihood function using MCMC samples, and was able 
to handle both small and medium size problems. For small prob- 
lems, its performance was similar to that of exact inference. 
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